# Analyzing Political Text at Scale with Online Tensor LDA (TLDA)
Replication package README

This repository contains scripts and instructions to reproduce the experimental pipelines used in **Analyzing Political Text at Scale with Online Tensor LDA**. The code builds on the TensorLy python library, which as incorporated the TLDA implementation written by the authors of the paper associated with these experiments. This software can be used to process large-scale social media text and estimate topic model outputs via moment-based tensor methods on GPU, from end-to-end and at scale.

## Technical References
- TensorLy: a high-level Python library for tensor learning and tensor algebra with pluggable backends (NumPy, PyTorch, CuPy, etc.). It provides tensor operations and decompositions used throughout these experiments. See: https://tensorly.org/stable/index.html
- TLDA: TensorLy’s experimental implementations and utilities for Tensor LDA and related decomposition routines used here for scalable, moment-based topic modeling. See: https://tensorly.org/tlda/dev/

## Data availability and Terms of Use
Due to the original platforms’ Terms of Use governing data acquisition at the time of collection, **this replication package does not distribute raw social-media content**. While all code is reported here, scripts provide pre-set configurations to produce and consume **aggregated data only**: document–term matrices, aggregate statistics, intermediate objects (e.g., whitened cumulants, factor estimates), vocabularies, topic loadings, and derived diagnostics.

## Overview of experiments and archives
We provide three experiment tracks; each track is structured to run in order and may be treated as separate archives:

1. Covid Tweets
2. MeToo Tweets
3. Election2020 Tweets

Each track flows as:
- **A. Pipeline & model fitting**: parse inputs, build vocabulary, construct document–term matrices (DTMs), whiten, fit TLDA, and save intermediate objects.
- **B. Model export & top-words**: export unwhitened TLDA factors to CSV, and generate top-words lists/word clouds.
- **C. Validation & coherence**: compute coherence and related diagnostics (e.g., u_mass) and, where applicable, compare against baselines (scikit-learn LDA, Gensim LDA, PARAFAC).

## Hardware and software requirements
- NVIDIA GPU with CUDA (Required to run full achive). The pipelines use GPU-native libraries (CuPy, cuDF, cuML, cupyx) and will be significantly faster on GPU.
- Python ≥ 3.9 recommended.
- CUDA-compatible CuPy (e.g., `cupy-cuda12x` or `cupy-cuda11x` via pip; choose the build matching your CUDA toolchain).
- RAPIDS libraries via conda (recommended): `cudf`, `cuml`. (Install versions compatible with your CUDA and Python.)
- Other dependencies (installed via pip/conda):
  - `tensorly`, `scikit-learn`, `scipy`, `numpy`, `pandas`, `matplotlib`, `wordcloud`, `gensim`, `nltk`
- NLTK resources (one-time):  
  ```bash
  python -c "import nltk; nltk.download('stopwords')"
  ```

### Suggested conda environment (template)
> Adjust package versions to match your CUDA setup.
```bash
conda create -n tlda-repl python=3.10 -y
conda activate tlda-repl

# RAPIDS (example; consult RAPIDS install matrix for exact versions)
# conda install -c rapidsai -c conda-forge -c nvidia cudf cuml cupy -y

# Alternatively install CuPy wheels that match your CUDA:
# pip install cupy-cuda12x

pip install tensorly scipy numpy pandas scikit-learn gensim matplotlib wordcloud nltk
```

## Repository layout (key scripts)
The replication package includes the following top-level scripts. Many rely on internal modules under the project (e.g., `version0_20`, `version0_15`, `version0_99`, or `lda/tlda` subpackages). Ensure the project root is on your `PYTHONPATH` (e.g., `export PYTHONPATH=$PYTHONPATH:$PWD`).

### A. Pipelines & model fitting
- `01_covid_pipeline.py`  
  End-to-end Covid experiment: builds a domain-tailored vocabulary, constructs DTMs using GPU `CountVectorizer` (cuML) with unigrams and bigrams, whitens with PCA, fits TLDA, and persists outputs. A parameter grid in `main()` iterates over topic counts and TLDA hyperparameters.

- `02_metoo_pipeline.py`  
  Mirrors the Covid pipeline for the MeToo corpus, with tailored stopwords and identical output structure.

- `01_EF_batched_validation.py`  
  Election2020/Election-Fraud pipeline with batched processing and validation utilities. Uses `cupyx.scipy.sparse` CSR batches and computes coherence/diagnostics across concatenated batches.

### B. Export & top-words
- `02_tlda_export.py`  
  Loads a fitted TLDA object (`tlda.obj`), computes unwhitened factors, and exports topic–word matrices to CSV (`tlda.csv`) for downstream analysis (e.g., R).

- `03_covid_top_words.py`  
  Loads TLDA factors and `vocab.csv`, extracts top-N words per selected topics, saves a topic–topwords CSV, and generates word clouds (via `wordcloud`).

### C. Coherence & comparisons
- `03_coherence_comparison.py` and `04_coherence_comparison_large.py`  
  Compute coherence (e.g., `u_mass`) and cosine-similarity diagnostics; compare TLDA to baselines: scikit-learn LDA, Gensim LDA, and symmetric PARAFAC. These scripts also include utilities for preprocessing and evaluation on benchmark corpora (e.g., 20 Newsgroups) and real datasets.

## Constants, inputs, and outputs
Most pipelines define these constants near the top (adjust per dataset):

- **Roots and inputs**  
  - `ROOT_DIR`: dataset root (e.g., `/path/to/PA-Replication/Covid/data/`).  
  - `INDIR`: subfolder containing pre-split input files (e.g., `split_files/`). Large datasets are processed in chunks.  
  - Optional dataset-specific prefixes for labels, dates, or partitions (e.g., `data/data_split{2,3,4}/` in Election2020).

- **Outputs (relative to the experiment run directory)**  
  - `x_mat/`: batched sparse DTMs (`.obj` pickles), typically CUDA-friendly CSR via `cupyx`.  
  - `corpus/` and `corpus.obj`: Gensim corpus artifacts when used.  
  - `countvec.obj`: fitted `CountVectorizer`.  
  - `tlda.obj`: fitted TLDA model (includes whitened and unwhitened factors).  
  - `vocab.csv`: vocabulary (one token per row).  
  - `dtm.csv` and `dtm_df.csv`: document–topic matrices (DTMs) / metadata-joined versions.  
  - `coherence.obj`: serialized coherence diagnostics.  
  - `predicted_topics/`: per-document topic assignments (if exported).  
  - `ids/`: document IDs aligned with DTMs.  
  - `top_words.csv`: top words per topic (if generated).  
  - `alpha_weights.txt`: (when produced) per-topic weights.

> All inputs and outputs are **aggregated** representations only.

## Text preprocessing
- GPU `CountVectorizer` (`cuml.feature_extraction.text.CountVectorizer`) with `ngram_range=(1,2)` and optional `max_df` controls.  
- NLTK stopwords augmented with **domain-specific stop lists** (Covid, MeToo, Election2020) tailored to conversational tokens, hashtags, and function words.  
- Optional stemming (`PorterStemmer`) and custom tokenization (`preprocess_efficient` utilities).

## TLDA training
- Whitening via PCA (custom `PCA` in `version0_20` or `IncrementalPCA` for baselines).  
- TLDA hyperparameters (set in each pipeline’s `main()`/`fit_topics`):
  - `num_tops` (K): number of topics.
  - `alpha_0`: Dirichlet mass for topics.
  - `learning_rate`: optimizer step size.
  - `theta_param`, `ortho_loss_param`: orthogonality/regularization controls.
  - `n_eigenvec`: whitening dimension (often ∝ K).
  - `initialize_first_docs`: initialize factors using early-document moments for stability.
- Seeds: `cp.random.seed(1000)` (CuPy) / `np.random.seed(1000)` (NumPy).

## Coherence and diagnostics
- UMass coherence via Gensim (`gensim.models.CoherenceModel`) using the learned topics and the lemmatized texts/dictionary.
- Cosine similarity between topics and baseline decompositions for robustness checks.
- Memory-aware batching using `cupyx.scipy.sparse` with explicit GPU memory-pool resets.

## Running the pipelines (order)
Below shows the flow for each track. Replace `ROOT_DIR` and other constants as needed.

### Covid
```bash
# A. Fit TLDA and produce artifacts
python 01_covid_pipeline.py

# B. Export TLDA factors to CSV for external analysis
python 02_tlda_export.py  # edit the hardcoded tlda.obj path inside if needed

# B (optional). Top words and word clouds
python 03_covid_top_words.py

# C. Coherence/comparisons (optional)
python 03_coherence_comparison.py
python 04_coherence_comparison_large.py
```

### MeToo
```bash
# A. Fit TLDA and produce artifacts
python 02_metoo_pipeline.py

# C. Coherence/comparisons (optional)
python 03_coherence_comparison.py
python 04_coherence_comparison_large.py
```

The following helper scripts extend the core pipelines with paper-ready tables and figures for the MeToo track:

### `05_generate_tables.py`
Builds aggregate tables for the paper (e.g., coherence, topic diagnostics, dataset stats).

**Inputs (examples detected):** data/seeds2.txt, data/synthetic_data.obj, data/true_alpha.obj, data/true_mu.obj, results/factors_sklearn_whitened.txt, results/factors_tlda_whitened.txt, results/mu_whitened.txt, results/res2.txt, results/results_20k.csv, results/results_correlation_20k.csv, results/whitening_matrix_accuracy.txt
**Outputs (examples detected):** _addM1.pdf, data/synthetic_data.obj, data/true_alpha.obj, data/true_mu.obj, results/results_20k.csv, results/results_correlation_20k.csv
**Requires:** `csv, cupy, matplotlib, numpy, pickle, random, scipy, sklearn, sys, tensorly, time, version0_15, version0_20`

### `08_plot_metoo_tweets.py`
Generates figures/plots for the MeToo tweets experiment.

**Inputs (examples detected):** results/tot_tweets_metoo.csv
**Outputs (examples detected):** results/metoo_tweets_over_time.png
**Requires:** `matplotlib, pandas`

### `09_metoo_top_words.py`
Computes and exports top-N words per topic for the MeToo corpus.

**Inputs (examples detected):** data/metoo_experiment/num_tops_10_alpha0_0.001_learning_rate_0.0005_theta_5.005_orthogonality_1000_initialize_first_docs_True_n_eigenvec_40_n_docs_0_no_online/tlda.obj, data/metoo_experiment/vocab.csv
**Outputs (examples detected):** results/metoo_topics_tlda.csv
**Requires:** `numpy, pandas, pickle, tensorly`



### Election2020
```bash
# A. Batched pipeline and validation
python 01_EF_batched_validation.py  # ensure data split directories exist and constants point to them

# C. Additional coherence/comparisons (optional)
python 03_coherence_comparison.py
python 04_coherence_comparison_large.py
```

## Practical tips
- **GPU memory**: For OOM, reduce `n_eigenvec`, vocabulary size, or the file-split/batch size; the code also frees CuPy memory pools between batches.
- **Paths**: Many scripts include hardcoded `ROOT_DIR` and experiment subfolders (e.g., `covid_experiment/`). Adjust before running.
- **PYTHONPATH**: Internal modules (e.g., `version0_20.tlda_final`, `version0_15.tensor_lda_clean`, `version0_99.file_operations`, `lda.tlda.file_operations`) must be importable.
- **Stopwords**: Ensure NLTK stopwords are downloaded; domain lists can be edited within each pipeline to refine token filtering.
- **CPU-only mode**: Some comparison scripts can run with the NumPy backend, but the primary pipelines use CuPy/cuDF/cuML and will be slow or incompatible without a CUDA GPU.

## Folder structure (typical)
```
project-root/
  data/
    Covid/
      data/
        split_files/            # pre-split aggregated inputs
      covid_experiment/
        num_tops_*_.../         # per-run outputs
          x_mat/
          corpus/
          ids/
          predicted_topics/
          tlda.obj
          vocab.csv
          countvec.obj
          dtm.csv
          dtm_df.csv
          coherence.obj
          top_words.csv
    MeToo/
      ...
    Election2020/
      data_split2/, data_split3/, data_split4/  # if using EF script conventions
  src/ (optional)
    version0_20/, version0_15/, version0_99/, lda/tlda/  # internal modules
  01_covid_pipeline.py
  02_metoo_pipeline.py
  01_EF_batched_validation.py
  02_tlda_export.py
  03_covid_top_words.py
  03_coherence_comparison.py
  04_coherence_comparison_large.py
```

## Reproducibility
- Random seeds are set where applicable.
- Outputs are deterministic up to floating point and backend differences.
- Save directories encode hyperparameters in their names for traceability (see pipeline `main()` for grids).



---



## R scripts: plotting & descriptive statistics

These R utilities produce the paper-ready figures that accompany the Python TLDA pipelines. To reitereate, they work on **aggregated data only** (CSV summaries, document–term matrices, and derived labels) and should be run **after** you have prepared the corresponding CSV inputs.

### R dependencies
Install the following packages (some are included within the `tidyverse` meta-package).

```r
install.packages(c(
  "tidyverse","data.table","dplyr","ggplot2","scales","lubridate","zoo",
  "tidytext","quanteda","text2vec","parallel","vars","vegan","ggforce",
  "tidyr","xtable","stringr","readr"
))
```

> Tip: Set the working directory to the repository root in R, e.g. `setwd("~/path/to/TLDA-Dev")`.

### `06_timelineplot.R`
Creates a timeline visualization of the **most prominent topic by month** with category coloring (Social, Political, Celebrity, Countermovement).

- **Reads:** `data/TimeLine_Data.csv`
- **Writes:** `plots/prom_topics.pdf` (18.5 x 8.5 inches)
- **Key aesthetics:** vertical sticks from `Rank` to 0, month and year labels, dashed year-end reference lines (2017-12-31, 2018-12-31).
- **Main libraries:** `scales`, `lubridate`, `ggplot2`, `dplyr`, `readr`

**Usage**
```r
setwd("~/path/to/TLDA-Dev")
source("06_timelineplot.R")   # produces plots/prom_topics.pdf
```

### `07_timelineplot_pol.R`
Political-institution timeline for MeToo related tweets, with prespecified categories {US-President, US-Congress, US-Supreme Court, Canada-PM, US-Policy}.

- **Reads:** `data/timeline_data_pol.csv`
- **Writes:** `plots/prom_topics_pol.pdf` (15 x 7.5 inches)
- **Key differences vs. `06_timelineplot.R`:** different color palette, smaller vertical offset for labels, year labels positioned lower.
- **Main libraries:** `scales`, `lubridate`, `ggplot2`, `dplyr`, `readr`

**Usage**
```r
setwd("~/path/to/TLDA-Dev")
source("07_timelineplot_pol.R")   # produces plots/prom_topics_pol.pdf
```

### `02_descriptive.stats.R`
Generates descriptive figures for the **Election Fraud** track: (i) Tweets per day and (ii) Topic composition (fake news / real news / legal).

- **Reads:** All CSV files under `data_new3/x_label/` (expects columns like `X0`, `X1`, … for topic probabilities and a `date` column)
- **Writes:**
  - `Plots/tweets_per_day.pdf` (16 x 9 inches)
  - `Plots/topical_composition.pdf` (16 x 9 inches)
- **Figures produced:**
  1. **Tweets per Day** (Figure 7): Daily volume (in millions), annotated vertical lines for 2020‑11‑07 and 2021‑01‑10.
  2. **Topic Composition of Election Fraud** (Figure 8): Weighted daily share (%) for three grouped topic sets (fake news, real news, legal) with annotations and event lines (2020‑11‑07, 2021‑01‑06).
- **Main libraries:** `data.table`, `tidyverse` (`dplyr`, `ggplot2`, `tidyr`, `readr`, `stringr`), `lubridate`, `tidytext`, `quanteda`, `text2vec`, `parallel`, `vars`, `vegan`, `ggforce`, `xtable`


Load the .rproj files to load the project file in Rstudio, eliminating the need to hardcode the working directories. 
**Usage**
```r
source("02_descriptive.stats.R")  # produces PDFs in Plots/
```

> **Note on output folders:** These scripts write to `plots/` and `Plots/` (note the capitalization). Create both directories or standardize the target directory before running:
>
> ```r
> dir.create("plots", showWarnings = FALSE, recursive = TRUE)
> dir.create("Plots", showWarnings = FALSE, recursive = TRUE)
> ```

### Place in Overall Pipeline
1. Run the Python pipelines to build **aggregated** CSVs and topic labels for each track.
2. Ensure the CSVs required by the R scripts are present in the expected folders (`data/`, `data_new3/x_label/`).
3. Run the R scripts above to generate the figures for inclusion in the paper or slides.



## Citation
If you use this data or the package linked above, please cite the accompanying paper:
> *Analyzing Political Text at Scale with Online Tensor LDA*.  
> [We will add the full citation here once finalized+published.]

For questions or issues, contact the authors.
